PAN 2017: Author Profiling - Gender and Language Variety Prediction

نویسندگان

Matej Martinc

Iza Skrjanec

Katja Zupan

Senja Pollak

چکیده

We present the results of gender and language variety identification performed on the tweet corpus prepared for the PAN 2017 Author profiling shared task. Our approach consists of tweet preprocessing, feature construction, feature weighting and classification model construction. We propose a Logistic regression classifier, where the main features are different types of character and word n-grams. Additional features include POS n-grams, emoji and document sentiment information, character flooding and language variety word lists. Our model achieved the best results on the Portuguese test set in both—gender and language variety—prediction tasks with the obtained accuracy of 0.8600 and 0.9838, respectively. The worst accuracy was achieved on the Arabic test set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

INSA LYON and UNI PASSAU's Participation at PAN@CLEF'17: Author Profiling task

This paper describes the participation of INSA Lyon and UNI Passau at the PAN 2017 Author Profiling task. Given the language and tweets from an author, the goal is to predict his/her gender and language variety. We consider two strategies : a "loose" classification that learns one predictive model for the gender and another one for the variety, and a "successive" classification that first predi...

متن کامل

A Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure

Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...

متن کامل

Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter

This overview presents the framework and the results of the Author Profiling task at PAN 2017. The objective of this year is to address gender and language variety identification. For this purpose a corpus from Twitter has been provided for four different languages: Arabic, English, Portuguese, and Spanish. Altogether, the approaches of 22 participants are evaluated.

متن کامل

Including Dialects and Language Varieties in Author Profiling

This paper presents a computational approach to author profiling taking gender and language variety into account. We apply an ensemble system with the output of multiple linear SVM classifiers trained on character and word ngrams. We evaluate the system using the dataset provided by the organizers of the 2017 PAN lab on author profiling. Our approach achieved 75% average accuracy on gender iden...

متن کامل

Using Character n-grams and Style Features for Gender and Language Variety Classification

Author profiling is the problem of determining the characteristics of an author of an anonymous text. In this paper, we detail a method to determine the language variety and the gender of the authors of tweets, as a submission for the Author Profiling Task at PAN 2017. This method seeks to select the most significant character n-grams for each class considered, combining them with style feature...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

PAN 2017: Author Profiling - Gender and Language Variety Prediction

نویسندگان

چکیده

منابع مشابه

INSA LYON and UNI PASSAU's Participation at PAN@CLEF'17: Author Profiling task

A Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure

Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter

Including Dialects and Language Varieties in Author Profiling

Using Character n-grams and Style Features for Gender and Language Variety Classification

عنوان ژورنال:

اشتراک گذاری